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Abstract 

We have extended our recently developed 2-step approach for gene-based analysis to the family design and to 
the analysis of rare variants. The goal of this approach is to study the joint effect of multiple single-nucleotide 
polymorphisms that belong to a gene. First, the information in a gene is summarized by 2 variables, namely the 
empirical Bayes estimate capturing common variation and the number of rare variants. By using random effects for 
the common variants, our approach acknowledges the within-gene correlations. In the second step, the 2 
summaries were included as covariates in linear mixed models. To test the null hypothesis of no association, a 
multivariate Wald test was applied. We analyzed the simulated data sets to assess the performance of the method. 
Then we applied the method to the real data set and identified a significant association between FRMD4B and 
diastolic blood pressure (p-value = 8.3 x 1CT 12 ). 



Background 

Testing for the joint effect of single-nucleotide polymorph- 
isms (SNPs) located in a gene is a popular alternative to 
single-marker tests. Single SNP methods are underpow- 
ered because single SNPs have typically small effect sizes 
(common variants) or small minor allele frequencies 
(MAFs) (rare variants). In contrast, approaches that model 
the combined effect of multiple SNPs will be more power- 
ful. Recently, we have proposed a method consisting of 2 
steps [1,2]: (a) the dimensionality of the genetic data is 
reduced, and gene-specific summaries are produced, and 
(b) these summaries are introduced as covariates in the 
model for the phenotype. To model the correlation among 
SNPs within a gene, we use a generalized linear mixed 
model for the SNPs. A gene-level random effect captures 
the correlation within each gene. The empirical Bayes esti- 
mates of the random effects per subject and gene are used 
as summary measures of the SNPs data and are included 
in the phenotype model to test for association. 
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For the Genetic Analysis Workshop 17 (GAW17), we 
studied the performance of this approach for the 
sequence data on the families [2]. The conclusion was 
that common variants were well represented by this gene 
summary, but variation due to rare variants was not well 
captured. For rare variants several collapsing methods 
have been developed (see [3] for an overview). 

Using the GAW18 data, we extend our 2-step method 
to the combined familial and longitudinal setting. Next to 
the empirical Bayes estimates, we will also consider the 
total number of rare variants (MAF < 0.05) to summarize 
the information on rare SNPs within a gene. Using the 
200 simulated data sets, we assess the power of our 
approach for genes with functional loci. We compare 
results obtained by using the original Variant Call Format 
(VCF) files as well as the files in which the genotypes of 
relatives having only genome-wide association (GWA) 
data were imputed. Gene summaries will be computed 
based on all variants and based on only functional var- 
iants. For the simulated data sets, we assume that both 
gene effects are constant over time. Finally, we present 
the results of analyzing the true data set for associations 
with diastolic blood pressure (DBP). In this data set, we 
have identified one significant result when using the VCF 
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files, assuming no interactions between gene summaries 
and time, and 2 borderline significant associated genes 
when we included an interaction between time and gene 
effects to allow gene effects to change over time. Because 
of the computational burden, we consider only chromo- 
some 3. 

Methods 

Study sample 

We considered data for 959 individuals from 20 families. 
Of these, 464 are directly sequenced individuals; for their 
family members, imputed WGS data based on the exist- 
ing GWA framework were available. In this article, we 
restrict ourselves to genotypic data from chromosome 3. 
For each individual, we have information on age at exam- 
ination and current tobacco smoking (yes = 1; no = 0) for 
up to 4 time points for real data and for up to 3 points 
for the simulated data. We analyzed the quantitative trait 
DBP. There are 1274 genes on chromosome 3. From 
these, 927 and 929 genes contain at least one rare variant 
for the imputed files and one for the VCF files. When 
restricting the analysis to functional variants, these num- 
bers are 46 and 43, respectively. To extract genes, we 
used the R packages GenomicFeatures and RSQLite from 
Bioconductor. We used the tool ANNOVAR [4] with the 
UCSC Known Genes database [5] to select the exonic 
SNPs and their predicted effect on the protein. If this 
effect is known and not synonymous (e.g., nonsynon- 
ymous, or stop codon introducing or removal), the SNP 
was considered functional. See Almasy et al [6] for a 
more extended description of the data set. 

Model specification 

Let y ijt be the outcome variable for individual / from 
family i at time point t. For a specific gene g, let w ijgs be 
the genotype at SNP s (s = 1,..., S). The genotype w ijgs is 
coded 0, 1, or 2. For individual /' of family i, let x ijt be 
vectors with covariate values for the phenotypes (age 
and smoking status). 

Gene summaries 

We assume that Hardy Weinberg equilibrium holds. We 
consider a random gene effect to model the correlation 
among SNPs within a gene. Let bj g be the random gene 
effect of gene g for subject /. Given this random effect, Wij gs , 
the number of minor variants for SNP s in gene g is assumed 
to follow a binomial distribution with n = 2 trails and prob- 
ability TT^ S . The probability n ijgs is modeled as follows: 



where bj g follows a normal distribution with zero 
mean and variances a 2 . For each individual and each 



gene, the empirical Bayes estimate is given by eb\jg = bjg- 
Intuitively, the value of the empirical Bayes estimate will 
increase with the number of variants a subject carries. 
These models are fitted using the package lme4 in R. 
Because rare variants are not well captured by the 
empirical Bayes estimates, we also consider the total 
number of rare variants s ijg (MAF <0.05) as a second 
summary measure of the genetic information per 
subject. 

Phenotype model 

Both the empirical Bayes estimates and the number of 
rare variants in a specific gene can be plugged into the 
models for the phenotypes to test for gene effects. To 
model the longitudinal quantitative trait DBP, we use 
the following linear mixed model for each gene g: 

Y ijt = M + fixyt + Yi ebij g + nsijg + Uy + e ijt , (2) 

with Sij g the number of rare variants within gene g k« 
a normally distributed random family effect, and e t j t a 
normally distributed residual with a T times T covar- 
iance matrix to model the correlation between repeated 
measurements within a person (unstructured covariance 
error term). Here T is the number of time points. The 
variance of the family effect u» is equal to variance x 2 
and the correlation of u i =(u il ....Ui ni ) within a family of 
size m, is assumed to be equal to 2 times the kinship 
coefficient between the relatives (polygenic). Estimates 
of all model parameters including the T(T+l)/2 para- 
meters of the unstructured covariance matrix were 
obtained by maximizing the likelihood function using 
the optim function in R [7]. Based on Model (2) we can 
test the null hypothesis of no gene effect, which is 
equivalent to testing the null hypothesis H 0 : Yi=Y2 = 0. 
We used a multivariate Wald statistic with 2 degrees of 
freedom. In addition, to model time-dependent gene 
effects we added interaction terms between the gene 
effects (eb and s) and the time variable. The correspond- 
ing multivariate Wald tests for association of a gene 
(cross-sectional and over time) will have IT degrees of 
freedom; j5-values smaller than 6 x 10' 5 were considered 
to be statistically significant (Bonferroni correction) 
while /^-values smaller than 10~ 4 were considered to be 
borderline significant. 

Results 

Power 

The maximum power was achieved for MAP4 gene 
using the imputed files and restricting the analysis to 
functional variants (96.5% power for the 2 degrees of 
freedom test). For all variants based on imputed files the 
power for this gene was only 36.5%. For functional var- 
iants based on the VCF files the power was 72.5%. 
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Indeed, the total percentage of variance explained by the 
loci for MAP4 was largest among the genes at chromo- 
some 3, namely 0.0648 of all associated genes. 

To show the performance of our methods for various 
genetic effects, we also provide each gene with func- 
tional loci (Supplemental Table 1 of GAW18 answers) 
the percentage of data sets with a p-value smaller than 
0.05 in Table 1. The results for the univariate Wald 
tests and the multivariate Wald tests are given. In addi- 
tion to MAP4 gene, RYBP, ZBTB38, and GPR160 had in 
more than 10% of the data sets a p-value smaller than 
5% for the multivariate test. For RYBP and GPR160, the 
sum of rare variants showed a better performance than 
the empirical Bayes estimate. Indeed, RYBP has 2 func- 
tional loci with MAFs less than 0.05, and GPR160 has 
one functional locus with a MAF less than 0.05. For 
ZBTB238 and MAP4, the empirical Bayes gene summary 
performed better. Gene ZBTB238 has 2 functional loci 
with MAFs less than 0.05. Because the effects are oppo- 
site, the sum score has no power. Gene MAP4 has 12 
functional loci with MAFs less than 0.05. Seven of these 
loci carriers of the rare variant had a smaller DBP, and 



5 loci carriers of the rare variant had a higher DBP than 
noncarriers. 

Overall, the percentage of genes with a significant 
result at the 5% level appeared to be 7.1%. 

Analysis of real data set 

For the real data set, we did not find any significant 
results when we used the imputed files. The smallest 
p-value was 0.002 for gene COX17. When using the 
VCF files, we identified a significant association between 
FRMD4B and DBP (p-value = 8.3 x 10' 12 ). The total 
number of variants in this gene was 2348; 1388 SNPs 
had MAFs smaller than 5%. The p-value for the empiri- 
cal Bayes estimate was 5.3 x 10' 6 and for the number of 
rare variants, 0.057. When using the imputed files, the 
p-value for this gene was only 0.32. 

When we included an interaction term between the 
gene summaries and time, we identified 2 more genes 
showing borderline significance when using the VCF 
files. The genes are MUSTN1 and GTDC2 with p-values 
of 7.5 x 10" 5 and 9.9 x 10 s , respectively. For both genes, 
the association between the rare variants and DBP 



Table 1 Power based on analysis of genes at chromosome 3 in simulated datasets. 



Gene 


Number of variants 


% Variance of largest functional variant 


Power of eb 1 


Power of s 2 


Power of combined 3 


PDCD6I 


466 


0.00040 


0.0 


7.0 


2.5 


DNASE1L3 


115 


0.00014 


2.0 


4.5 


1.5 


PTPLB 


493 


0.00002 


3.5 


8.5 


5.5 


PAK2 


409 


0.00005 


3.5 


0.6 


4.0 


FBLN2 


687 


0.00008 


0.5 


3.0 


0.5 


FLNB 


956 


0.00085 


11.0 


5.0 


7.0 


VPS8 


1042 


0.00008 


3.0 


12.0 


5.0 


RYBP 


347 


0.00041 


6.0 


21.0 


15.5 


ZBTB38 


590 


0.00031 


51.0 


4.0 


34.5 


GPR1602 


44 


0.00020 


3.0 


19.5 


12.0 


SERP1 


18 


0.00002 


0.0 


5.0 


1.5 


SUMF1 


747 


0.00010 


2.0 


1.0 


1.0 


NMNAT3 


559 


0.0001 1 


5.0 


5.5 


6.5 


ARF4 


161 


0.00004 


1.5 


3.5 


2.0 


MAP4 


894 


0.01222 


99.0 


30.5 


97.0 


MLH1 


310 


0.00007 


2.5 


2.0 


1.5 


ARHGEF3 


2223 


0.00007 


2.0 


5.0 


3.0 


PPP2R3A 


1081 


0.00025 


0.5 


1.0 


1.0 


MUC13 


203 


0.00007 


5.0 


0.5 


3.0 


RAD18 


693 


0.00003 


3.0 


3.5 


3.5 


SFMA3F 


134 


0.00004 


12.0 


7.5 


6.5 


BFD 


291 


0.00011 


1.0 


2.5 


1.0 


ABFB1 


48 


0.00053 


3.5 


1.1 


6.5 


B4GALF4 


217 


0.00004 


2.0 


4.5 


3.0 



The values represent the percentages of significant results at 5% level. Imputed data sets were used. 

1 Empirical Bayes estimate H 0 :y, = 0. 

2 Number of rare variants H 0 :y 2 = 0. 

3 HoTi = 0 and y 2 = 0. 
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Figure 1 Estimates of the model parameters y, and y 2 .for 

MUSTN1 and GTDC2, estimates of the model parameters y, and j2 
for empirical Bayes estimates eb and number of rare variants 5 for 
the 4 time points. 



appears to be largest (smallest /^-values). For MUSTN1, 
the effect is most pronounced for the first time points; 
for GTDC2, the association is most significant for the last 
time points. The effect sizes are depicted in Figure 1. For 
GTDC2, the parameter estimates for the eb gene sum- 
mary increase over time. 

Discussion 

We used a 2-step procedure in which, first, the informa- 
tion in a gene is summarized by 2 variables, namely the 
empirical Bayes estimate capturing common variation and 
the number of rare variants. In the second step, these 
summaries are included as covariates in linear mixed mod- 
els. Interactions such as time-gene summary can be 
included. The results of the analysis of real data showed 
that time-gene interactions may identify other genes. 
Another advantage is that this method can deal with miss- 
ing data. An alternative method is generalized estimating 
equations (GEEs). However, existing R-modules do not 
allow for flexible correlation structures, resulting in a con- 
servative approach when the working correlation does not 
agree with the true correlation structure. Moreover, 
whereas the GEE method assumes complete missing at 
random, mixed models are valid under the missing at ran- 
dom assumption. 

Using the linear mixed model without an interaction 
effect between time and gene summaries for the real 
data set, we identified one significant gene when using 
the VCF file, FRMD4B. The association of this gene 
appears to be biologically sound. It has been found to 
be associated with heart failure [8]. Using the real data 



set, we did not find any association using the imputed 
data sets. Probably, the followed imputation procedure 
provided noise. In addition, 2 genes showed borderline 
significance when allowing the effect of the gene sum- 
maries to change over time. 

With regard to the power of the method, the power 
was good only for MAP4. Restriction of the analysis to 
functional variants improved the performance consider- 
ably. This can be explained by the fact that in the simu- 
lation model, only the "functional" loci are associated. 
Note that for the real data, significant results were 
obtained only when all SNPs were used. This shows that 
restricting to functional variants may result in false 
negatives. 

The empirical Bayes summary appeared to perform 
well. The sum of rare variants, however, does not per- 
form well when variants have opposite effects. In addi- 
tion, the sum score does not assign more weights to a 
variant that segregates within the family compared with 
a rare variant that occurs in several families. To include 
this information in testing for association of rare var- 
iants will require future research. 

Conclusions 

The 2-step approach is a flexible method for performing 
a gene-based analysis: it can be used for any design and 
can model time-dependent effects in longitudinal designs. 
For this relatively small sample size, this approach was 
able to detect genes that explain 0.0648% of the variance 
(power of 97%). With regard to the real data set, the asso- 
ciation between gene FRMD4B and DBP was significant 
(p = 8.3 x 10" 12 ). 
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